What is the intended goal?
The objective of this problem is to build a time series model using the AR, MA, ARMA, ARIMA and SARIMA models that can forecast the CO2 emissions value for natural gas (NNEIEUS) fuel type for the next 12 months and propose certain measures that can be adopted as policies to reduce these emissions.
This datset is the past monthly data of Carbon dioxide emissions from electricity generation from the US Energy Information Administration categorized by fuel type such as Coal, Natural gas etc.
MSN:- Reference to Mnemonic Series Names (U.S. Energy Information Administration Nomenclature)
YYYYMM:- The month of the year on which these emissions were observed
Value:- Amount of CO2 Emissions in Million Metric Tons of Carbon Dioxide
Description:- Different category of electricity production through which carbon is emissioned.
This notebook can be considered a guide to refer to while solving the problem. The evaluation will be as per the Rubric shared for each Milestone. Unlike previous courses, it does not follow the pattern of the graded questions in different sections. This notebook would give you a direction on what steps need to be taken in order to get a viable solution to the problem. Please note that this is just one way of doing this. There can be other 'creative' ways to solve the problem and we urge you to feel free and explore them as an 'optional' exercise.
In the notebook, there are markdown cells called - Observations and Insights. It is a good practice to provide observations and extract insights from the outputs.
The naming convention for different variables can vary. Please consider the code provided in this notebook as a sample code.
All the outputs in the notebook are just for reference and can be different if you follow a different approach.
There are sections called Think About It in the notebook that will help you get a better understanding of the reasoning behind a particular technique/step. Interested learners can take alternative approaches if they want to explore different techniques.
# Uncomment to upgrade statsmodels
!pip install statsmodels --upgrade
#Import basic libraries
import pandas as pd
import warnings
import itertools
import numpy as np
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_excel('/content/drive/My Drive/MER_T12_06.xlsx')
df.head()
#to ignore warnings
import warnings
import itertools
warnings.filterwarnings("ignore")
#conversion of "YYYYMM" columnn into standard datetime format & making it as index
# We are using errors=’coerce’. It will replace all non-numeric values with NaN.
dateparse = lambda x: pd.to_datetime(x, format='%Y%m', errors = 'coerce')
df = pd.read_excel('/content/drive/My Drive/MER_T12_06.xlsx', parse_dates=['YYYYMM'], index_col='YYYYMM', date_parser=dateparse)
df.head(15)
The arguments can be explained as:
ts = df[pd.Series(pd.to_datetime(df.index, errors='coerce')).notnull().values]
ts.head()
#Check the datatypes of each column. Hint: Use dtypes method
print(ts.dtypes)
#convert the emission value into numeric value
ts["Value"] = pd.to_numeric(ts["Value"], errors='coerce')
#Check total number of missing values of each column. Hint: Use isnull() method
ts.isna().sum()
#Drop the missing value using dropna(inplace = True)
ts.dropna(inplace = True)
ts_per_source = ts.groupby('Description')
ts_per_source.head(5)
cols = ['Geothermal Energy', 'Non-Biomass Waste', 'Petroleum Coke','Distillate Fuel',
'Residual Fuel Oil', 'Petroleum', 'Natural Gas', 'Coal', 'Total Emissions']
cols_code=ts.Description.unique()
cols_code
cols_re_ordered = ['Coal', 'Natural Gas', 'Distillate Fuel', 'Petroleum Coke', 'Residual Fuel Oil', 'Petroleum', 'Geothermal Energy', 'Non-Biomass Waste',
'Total Emissions']
fix, ax = plt.subplots(figsize=(16,7))
for label, df in ts.groupby('Description'):
df.Value.plot()
plt.legend(cols_re_ordered)
plt.title('Emission in the power generation over time')
plt.xlabel('Year', fontsize=15)
plt.show()
fix, ax = plt.subplots(figsize=(16,7))
cols_code=ts.Description.unique()
for label, df in ts.groupby('Description'):
df.Value.plot()
plt.legend(cols_code)
plt.title('Emission in the power generation over time')
plt.xlabel('Year', fontsize=15)
plt.show()
###Code here
ts_grouped = ts.groupby('Description')
cols_code=ts.Description.unique()
cols_code
for col in cols_code:
plt.figure(figsize=(16,8))
plt.xlabel("Month")
plt.ylabel(col)
plt.title('Carbon Emission')
plt.plot(ts_grouped.get_group(col).index,ts_grouped.get_group(col).Value,color = 'c', marker='.')
plt.show()
CO2_per_source = ts.groupby('Description')['Value'].sum().sort_values().to_frame()
CO2_per_source.head(9)
cols = ['Geothermal Energy', 'Non-Biomass Waste', 'Petroleum Coke','Distillate Fuel ',
'Residual Fuel Oil', 'Petroleum', 'Natural Gas', 'Coal', 'Total Emissions']
cols_1 = ['Coal', 'Distillate Fuel', 'Geothermal Energy', 'Natural Gas', 'Non-Biomass Waste', 'Petroleum Coke', 'Petroleum', 'Residual Fuel Oil', 'Total Emissions']
CO2_per_source["Energy_Type"] = cols_1
CO2_per_source.head()
##Code here
CO2_per_source.Value.plot(kind='bar', figsize=(8,4))
plt.xlabel(cols)
plt.show()
Emissions = ts.iloc[:,1:] # Monthly total emissions (mte)
Emissions = Emissions.groupby(['Description', pd.Grouper(freq="M")])['Value'].sum().unstack(level = 0)
mte = Emissions['Natural Gas Electric Power Sector CO2 Emissions'] # monthly total emissions (mte)Emissions.head()
mte.head()
For developing the time series model and forecasting, you are expected to use the natural gas CO2 emission from the electrical power generation. We need to slice this data:
#Importing library for date manipulation
from datetime import datetime
#To calculate the MSE or RMSE
from sklearn.metrics import mean_squared_error
#Importing acf and pacf functions
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
#Importing models from statsmodels library
from statsmodels.tsa.ar_model import AutoReg
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
###Slice the data to get the monthly total CO2 emissions of Natural Gas Electric Power Sector
Emissions = ts.iloc[:,1:] # Monthly total emissions (mte)
Emissions = Emissions.groupby(['Description', pd.Grouper(freq="M")])['Value'].sum().unstack(level = 0)
ts_mte = Emissions['Natural Gas Electric Power Sector CO2 Emissions'] # monthly total emissions (mte)Emissions.head()
#Check 1st few rows of data
ts_mte= Emissions[['Natural Gas Electric Power Sector CO2 Emissions']].rename(columns={'Natural Gas Electric Power Sector CO2 Emissions':'Value'})
ts_mte.head()
# Split the data into train and test
ts_train, ts_test = ts_mte.iloc[:-12], ts_mte.iloc[-12:]
print(ts_train)
print(ts_test)
#Import the required package
import statsmodels
import statsmodels.api as sm
from statsmodels.tsa.stattools import coint, adfuller
# Calculate the rolling mean and standard deviation for a window of 12 observations
rolmean=ts_train.rolling(window=12).mean()
rolstd=ts_train.rolling(window=12).std()
# Visualize the rolling mean and standard deviation
plt.figure(figsize=(16, 8))
actual = plt.plot(ts_train, color='c', label='Actual Series')
rollingmean = plt.plot(rolmean, color='red', label='Rolling Mean')
rollingstd = plt.plot(rolstd, color='green', label='Rolling Std.Dev.')
plt.title('Rolling Mean & Standard Deviation of the Series')
plt.legend()
plt.show()
Use the Augmented Dickey-Fuller (ADF) Test to verify if the series is stationary or not. The null and alternate hypotheses for the ADF Test are defined as:
- Null hypothesis: The Time Series is non-stationary
- Alternative hypothesis: The Time Series is stationary
#Define a function to use adfuller test
def adfuller(dataset):
#Code here
from statsmodels.tsa.stattools import adfuller
print('Dickey-Fuller Test: ')
adftest = adfuller(dataset['Value'])
adfoutput = pd.Series(adftest[0:4], index=['Test Statistic','p-value','Lags Used','No. of Observations'])
for key,value in adftest[4].items():
adfoutput
adfoutput['Critical Value (%s)'%key] = value
print(adfoutput)
adfuller(ts_train)
Observations and Insights
We can use some of the following methods to convert a non-stationary series into a stationary one:
We take the average of ‘k’ consecutive values depending on the frequency of time series (in this capstone 12 months).
Here, we will take the average over the past 1 year.
# Visualize the rolling mean and standard deviation after using log transformation
plt.figure(figsize=(16,8))
ts_log = np.log(ts_train)
MAvg = ts_log.rolling(window=12).mean()
MStd = ts_log.rolling(window=12).std()
plt.plot(ts_log)
plt.plot(MAvg, color='r', label = 'Moving Average')
plt.plot(MStd, color='g', label = 'Standard Deviation')
plt.legend()
plt.show()
Observations and Insights: _
Visualize the rolling mean and rolling standard deviation of the shifted series (df_shift) and check the stationarity by calling the adfuller() function. Also, write your observations on the same.
##Code here
plt.figure(figsize=(16,8))
df_shift = ts_train - ts_train.shift(periods=1)
MAvg_shift = df_shift.rolling(window=12).mean()
MStd_shift = df_shift.rolling(window=12).std()
plt.plot(df_shift, color='c')
plt.plot(MAvg_shift, color='red', label = 'Moving Average')
plt.plot(MStd_shift, color='green', label = 'Standard Deviation')
plt.legend()
plt.show()
df_shift = df_shift.dropna()
Observations and Insights: _
adfuller(df_shift)
Observations and Insights: _
from statsmodels.tsa.seasonal import seasonal_decompose
decomp = seasonal_decompose(ts_train)
trend = decomp.trend
seasonal = decomp.seasonal
residual = decomp.resid
plt.figure(figsize=(15,10))
plt.subplot(411)
plt.plot(ts_train, label='Actual', marker='.')
plt.legend(loc='upper left')
plt.subplot(412)
plt.plot(trend, label='Trend', marker='.')
plt.legend(loc='upper left')
plt.subplot(413)
plt.plot(seasonal, label='Seasonality', marker='.')
plt.legend(loc='upper left')
plt.subplot(414)
plt.plot(residual, label='Residuals', marker='.')
plt.legend(loc='upper left')
plt.tight_layout()
Plot the auto-correlation function and partial auto-correlation function to get p and q values for AR, MA, ARMA, and ARIMA models
Plot the ACF and PACF charts and find the optimal parameters
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf
plt.figure(figsize = (16,8))
plot_acf(df_shift, lags = 24)
plt.show()
plot_pacf(df_shift, lags = 24)
plt.show()
Observations and Insights: _
Order p is the lag value after which the PACF plot crosses the upper confidence interval for the first time. These p lags will act as our features while forecasting the AR time series.
Fit and predict the shifted series with the AR Model and calculate the RMSE. Also, visualize the time series and write your observations.
#Importing AutoReg function to apply AR model
from statsmodels.tsa.ar_model import AutoReg
from sklearn.metrics import mean_squared_error
#Code here
plt.figure(figsize=(16,5))
model_AR = AutoReg(df_shift, lags=1) #Using number of lags as 1
results_AR = model_AR.fit()
plt.plot(df_shift)
predict = results_AR.predict(start=0,end=len(df_shift)-1)
predict = predict.fillna(0) #Converting NaN values to 0
plt.plot(predict, color='red')
plt.title('AR Model - RMSE: %.4f'% mean_squared_error(predict, df_shift['Value'], squared=False))
plt.show()
Observations & Insights: _
#check the AIC value of the AR model
results_AR.aic
Order q of the MA process is obtained from the ACF plot, this is the lag after which ACF crosses the upper confidence interval for the first time.
Fit and predict the shifted series with the MA Model and calculate the RMSE. Also, visualize the time series and write your observations.
#Code here
plt.figure(figsize=(16,5))
model_MA = ARIMA(df_shift, order=(0, 0, 1)) #Using p=0, d=0, q=1
results_MA = model_MA.fit()
plt.plot(df_shift)
plt.plot(results_MA.fittedvalues, color='red')
plt.title('MA Model - RMSE: %.4f'% mean_squared_error(results_MA.fittedvalues, df_shift['Value'], squared=False))
plt.show()
Observations & Insights: _
#check the AIC value of the MA Model
results_MA.aic
We will be using the above AR lag(P) & MA lag(Q) as a paramter and d=0 in ARIMA so that it will work as an ARMA model.
Fit and predict the shifted series with the ARMA Model and calculate the RMSE. Also, visualize the time series and write your observations.
# code here
plt.figure(figsize=(16,5))
model_ARMA = ARIMA(df_shift, order=(1, 0, 1)) #Using p=1, d=0, q=1
results_ARMA = model_ARMA.fit()
plt.plot(df_shift)
plt.plot(results_ARMA.fittedvalues, color='red')
plt.title('ARMA Model - RMSE: %.4f'% mean_squared_error(results_ARMA.fittedvalues, df_shift['Value'], squared=False))
plt.show()
Observations & Insights: _
Check the AIC value of the model
#code here
results_ARMA.aic
Fit and predict the shifted series with the ARIMA Model and calculate the RMSE. Also, visualize the time series and write your observations.
# code here
plt.figure(figsize=(16,5))
model_ARIMA = ARIMA(ts_train, order=(1,1,1)) #Using p=1, d=1, q=1
results_ARIMA = model_ARIMA.fit()
plt.plot(ts_train)
plt.plot(results_ARIMA.fittedvalues, color='red')
plt.title('ARIMA Model - RMSE: %.4f'% mean_squared_error(results_ARIMA.fittedvalues, ts_train['Value'], squared=False))
plt.show()
Observations & Insights:__
Check the AIC value of the model
results_ARIMA.aic
# Printing the fitted values from arima
predictions=pd.Series(results_ARIMA.fittedvalues)
predictions.head()
Use the correct inverse transformation depending on the model chosen to get back the original values.
Apply an inverse transformation on the predictions of the chosen model
#Add the code blocks based on the requirements
#Apply the ARIMA
#predictions = predictions.cumsum()
#predictions_log = pd.Series(ts_log['Value'].iloc[0], index=ts_log.index)
#predictions_log = predictions_log.add(predictions, fill_value=0)
#predictions_ARIMA = np.exp(predictions_log)
predictions_ARIMA = predictions.copy()
predictions_ARIMA.head()
predictions_ARIMA.dtype
#Apply the ARMA model
#predictions=pd.Series(results_ARMA.fittedvalues)
#predictions_cumsum = predictions.cumsum()
#predictions_log = pd.Series(ts_log['Value'].iloc[0], index=ts_log.index)
#predictions_log = predictions_log.add(predictions_cumsum, fill_value=0)
#predictions_ARMA = np.exp(predictions_log)
#predictions_ARMA.head()
#plt.figure(figsize=(16,5))
#plt.plot(ts_train, color = 'c', label = 'Original Series')
#plt.plot(predictions_ARMA, color = 'r', label = 'Predicted Series')
#plt.title('Actual vs Predicted')
#plt.legend()
#plt.show()
Plot the original vs predicted series
#Code here
plt.figure(figsize=(16,5))
plt.plot(ts_train, color = 'c', label = 'Original Series')
plt.plot(predictions_ARIMA, color = 'r', label = 'Predicted Series')
plt.title('Actual vs Predicted')
plt.legend()
plt.show()
Observations & Insights: _
#Add the code blocks based on the requirements
forecasted_ARIMA = results_ARIMA.forecast(steps=24)
forecasted_ARIMA
plt.figure(figsize=(16,8))
plt.plot(ts_mte, color = 'c', label = 'Original Series')
plt.plot(predictions_ARIMA, color = 'r', label = 'Prediction on Train data')
plt.plot(forecasted_ARIMA, label = 'Prediction on Test data', color='b')
plt.title('Actual vs Predicted')
plt.legend()
plt.show()
error_predictions_ARIMA = mean_squared_error(predictions_ARIMA, ts_train, squared= False)
error_predictions_ARIMA
error_forecasted_ARIMA = mean_squared_error(forecasted_ARIMA.iloc[:12,], ts_test, squared = False)
error_forecasted_ARIMA
Can we use other forecasting methods such as SARIMA to improve our model performance?
# Attempting SARIMA
#Differencing the series
plt.figure(figsize=(16,8))
df_seas = df_shift - df_shift.shift(periods = 12)
MAvg_seas = df_seas.rolling(window=12).mean()
MStd_seas = df_seas.rolling(window=12).std()
plt.plot(df_seas, color='c')
plt.plot(MAvg_seas, color='red', label = 'Moving Average')
plt.plot(MStd_seas, color='green', label = 'Standard Deviation')
plt.legend()
plt.show()
df_seas = df_seas.dropna()
plt.figure(figsize = (18,10))
plot_acf(df_seas, lags = 36)
plt.show()
plot_pacf(df_seas, lags = 20)
plt.show()
# p = 1
# q = 1
# SARIMA(1,1,1)(1,1,0)12
plt.figure(figsize=(16,8))
model_SARIMA = SARIMAX(ts_train, order=(0,1,0), seasonal_order=(1,1,0,12))
results_SARIMA = model_SARIMA.fit()
plt.plot(ts_train)
plt.plot(results_SARIMA.fittedvalues, color='red')
plt.title('SARIMA Model - RMSE: %.4f'% mean_squared_error(results_SARIMA.fittedvalues,ts_train['Value'], squared=False))
plt.show()
results_SARIMA.aic
spredictions=pd.Series(results_SARIMA.fittedvalues)
predictions_SARIMA = spredictions.copy()
predictions_SARIMA.head()
#Plotting the original vs predicted series
plt.figure(figsize=(16,8))
plt.plot(ts_train, color = 'c', label = 'Original Series')
plt.plot(predictions_SARIMA, color = 'r', label = 'Predicted Series')
plt.title('Actual vs Predicted')
# plt.ylim(0,200)
plt.legend()
plt.show()
#Forecasting the values for next 24 months
forecasted_SARIMA = results_SARIMA.forecast(steps=24) # here steps represent the number of months
# forecasted_SARIMA = np.exp(forecasted_SARIMA)
forecasted_SARIMA
error_predictions_SARIMA = mean_squared_error(predictions_SARIMA, ts_train, squared = False)
error_predictions_SARIMA
error_forecasted_SARIMA = mean_squared_error(forecasted_SARIMA.iloc[:12,], ts_test, squared = False)
error_forecasted_SARIMA
What model do you propose to be adopted? Why is this the best solution to adopt?
The SARIMA model will be proposed to be adopted for Natural gas based CO2 emission forecasting
By using the ARIMA, we get RMSE=2.9985, and aic=2566.2613118432405 by checking the AIC value of the model.
By using the SARIMA, we get RMSE=2.1916, and aic=2161.11635076344 by checking the AIC value of the model.
By applying the ARIMA model, we get RMSE=2.9984787795941177 on the original train data, and RMSE=5.567068557258182 on testing data.
In conclusion, the SARIMA model is showing the least RMSE and least AIC value. Also the SARIMA shows least RMSE values when applying on train data and test data.
Other problems need to be solved We need to understand the price for each energy source and connect to the CO2 emissions. We should also understand the consumer's monthly spend and monthly usage of the past.
Executive Summary
Problem and Solution Summary
Recommendations for Implementation
What other problems need to be explored and in what priority order?
Emissions = ts.iloc[:,1:]
Emissions = Emissions.groupby(['Description', pd.Grouper(freq="M")])['Value'].sum().unstack(level = 0)
#Emissions.drop('Total Energy Electric Power Sector CO2 Emissions')
Emissions.head()
##ts_mte_all = Emissions[['Coal Electric Power Sector CO2 Emissions', 'Distillate Fuel', 'Natural Gas', 'Non-Biomass Waste Electric Power Sector CO2 Emissions', 'Petroleum Coke Electric Power Sector CO2 Emissions', 'Petroleum Electric Power Sector CO2 Emissions', 'Residual Fuel Oil Electric Power Sector CO2 Emissions', 'Total Energy Electric Power Sector CO2 Emissions']]
##ts_mte_all = Emissions.rename(columns={'Coal Electric Power Sector CO2 Emissions':'Coal', 'Distillate Fuel, Including Kerosene-Type Jet Fuel, Oil Electric Power Sector CO2 Emissions':'Distillate Fuel', 'Geothermal Energy Electric Power Sector CO2 Emissions':'Geothermal Energ', 'Natural Gas Electric Power Sector CO2 Emissions':'Natural Gas', 'Non-Biomass Waste Electric Power Sector CO2 Emissions':'Non-Biomass Waste', 'Petroleum Coke Electric Power Sector CO2 Emissions':'Petroleum Coke', 'Petroleum Electric Power Sector CO2 Emissions':'Petroleum', 'Residual Fuel Oil Electric Power Sector CO2 Emissions':'Residual Fuel Oil', 'Total Energy Electric Power Sector CO2 Emissions': 'Total Energy'}, inplace=True)
#ts_mte_sources = Emissions[['Coal Electric Power Sector CO2 Emissions', 'Petroleum Coke Electric Power Sector CO2 Emissions']]
#ts_mte_sources= Emissions[['Coal Electric Power Sector CO2 Emissions', 'Petroleum Coke Electric Power Sector CO2 Emissions']].rename(columns={'Coal Electric Power Sector CO2 Emissions':'Coal', 'Petroleum Coke Electric Power Sector CO2 Emissions':'Petroleum Coke'})
#ts_mte_sources.head()
#ts_train, ts_test = ts_mte_sources.iloc[:-12], ts_mte_sources.iloc[-12:]
#print(ts_train)
#print(ts_test)
import statsmodels
import statsmodels.api as sm
from statsmodels.tsa.stattools import coint, adfuller
ts_mte_coal = Emissions['Coal Electric Power Sector CO2 Emissions']
ts_mte_coal = Emissions[['Coal Electric Power Sector CO2 Emissions']].rename(columns = {'Coal Electric Power Sector CO2 Emissions':'Coal'})
ts_mte_coal.head()
ts_train_coal, ts_test_coal = ts_mte_coal.iloc[:-12], ts_mte_coal.iloc[-12:]
print(ts_train_coal)
print(ts_test_coal)
decomp = seasonal_decompose(ts_train_coal)
trend = decomp.trend
seasonal = decomp.seasonal
residual = decomp.resid
plt.figure(figsize=(20,10))
plt.subplot(411)
plt.plot(ts_train_coal, label='Actual', marker='.')
plt.legend(loc='upper left')
plt.subplot(412)
plt.plot(trend, label='Trend', marker='.')
plt.legend(loc='upper left')
plt.subplot(413)
plt.plot(seasonal, label='Seasonality', marker='.')
plt.legend(loc='upper left')
plt.subplot(414)
plt.plot(residual, label='Residuals', marker='.')
plt.legend(loc='upper left')
plt.tight_layout()
def adfuller(dataset):
#Code here
from statsmodels.tsa.stattools import adfuller
print('Dickey-Fuller Test: ')
adftest = adfuller(dataset['Coal'])
adfoutput = pd.Series(adftest[0:4], index=['Test Statistic','p-value','Lags Used','No. of Observations'])
for key,value in adftest[4].items():
adfoutput
adfoutput['Critical Value (%s)'%key] = value
print(adfoutput)
adfuller(ts_train_coal)
plt.figure(figsize=(16,8))
df_shift_coal = ts_train_coal - ts_train_coal.shift(periods=1)
MAvg_shift = df_shift_coal.rolling(window=12).mean()
MStd_shift = df_shift_coal.rolling(window=12).std()
plt.plot(df_shift_coal, color='c')
plt.plot(MAvg_shift, color='red', label = 'Moving Average')
plt.plot(MStd_shift, color='green', label = 'Standard Deviation')
plt.legend()
plt.show()
df_shift_coal= df_shift_coal.dropna()
adfuller(df_shift_coal)
plt.figure(figsize = (16,8))
plot_acf(df_shift_coal, lags = 24)
plt.show()
plot_pacf(df_shift_coal, lags = 24)
plt.show()
plt.figure(figsize=(16,8))
df_seas_coal = df_shift_coal - df_shift_coal.shift(periods =12)
MAvg_seas = df_seas_coal.rolling(window=12).mean()
MStd_seas = df_seas_coal.rolling(window=12).std()
plt.plot(df_seas_coal, color='c')
plt.plot(MAvg_seas, color='red', label = 'Moving Average')
plt.plot(MStd_seas, color='green', label = 'Standard Deviation')
plt.legend()
plt.show()
df_seas_coal = df_seas_coal.dropna()
plt.figure(figsize=(16,8))
model_SARIMA = SARIMAX(ts_train_coal, order=(0,1,0), seasonal_order=(1,1,0,12))
results_SARIMA = model_SARIMA.fit()
plt.plot(ts_train_coal)
plt.plot(results_SARIMA.fittedvalues, color='red')
plt.title('SARIMA Model - RMSE: %.4f'% mean_squared_error(results_SARIMA.fittedvalues,ts_train_coal['Coal'], squared=False))
plt.show()
results_SARIMA.aic
plt.figure(figsize=(16,5))
model_ARIMA = ARIMA(ts_train_coal, order=(1,1,1)) #Using p=1, d=1, q=1
results_ARIMA = model_ARIMA.fit()
plt.plot(ts_train_coal)
plt.plot(results_ARIMA.fittedvalues, color='red')
plt.title('ARIMA Model - RMSE: %.4f'% mean_squared_error(results_ARIMA.fittedvalues, ts_train_coal['Coal'], squared=False))
plt.show()
results_ARIMA.aic
plt.figure(figsize=(16,5))
model_ARMA = ARIMA(df_shift_coal, order=(1, 0, 1)) #Using p=1, d=0, q=1
results_ARMA = model_ARMA.fit()
plt.plot(df_shift_coal)
plt.plot(results_ARMA.fittedvalues, color='red')
plt.title('ARMA Model - RMSE: %.4f'% mean_squared_error(results_ARMA.fittedvalues, df_shift_coal['Coal'], squared=False))
plt.show()
results_ARMA.aic
ts_mte_PetroleumCoke = Emissions['Petroleum Coke Electric Power Sector CO2 Emissions']
ts_mte_PetroleumCoke = Emissions[['Petroleum Coke Electric Power Sector CO2 Emissions']].rename(columns = {'Petroleum Coke Electric Power Sector CO2 Emissions': 'Petroleum Coke'})
ts_mte_PetroleumCoke.head()
ts_train_PetroleumCoke, ts_test_PetroleumCoke = ts_mte_PetroleumCoke.iloc[:-12], ts_mte_PetroleumCoke.iloc[-12:]
print(ts_train_PetroleumCoke)
print(ts_test_PetroleumCoke)
from statsmodels.tsa.seasonal import seasonal_decompose
decomp = seasonal_decompose(ts_train_PetroleumCoke)
trend = decomp.trend
seasonal = decomp.seasonal
residual = decomp.resid
plt.figure(figsize=(18,10))
plt.subplot(411)
plt.plot(ts_train_PetroleumCoke, label='Actual', marker='.')
plt.legend(loc='upper left')
plt.subplot(412)
plt.plot(trend, label='Trend', marker='.')
plt.legend(loc='upper left')
plt.subplot(413)
plt.plot(seasonal, label='Seasonality', marker='.')
plt.legend(loc='upper left')
plt.subplot(414)
plt.plot(residual, label='Residuals', marker='.')
plt.legend(loc='upper left')
plt.tight_layout()
def adfuller(dataset):
#Code here
from statsmodels.tsa.stattools import adfuller
print('Dickey-Fuller Test: ')
adftest = adfuller(dataset['Petroleum Coke'])
adfoutput = pd.Series(adftest[0:4], index=['Test Statistic','p-value','Lags Used','No. of Observations'])
for key,value in adftest[4].items():
adfoutput
adfoutput['Critical Value (%s)'%key] = value
print(adfoutput)
adfuller(ts_train_PetroleumCoke)
plt.figure(figsize=(16,8))
df_shift_PetroleumCoke = ts_train_PetroleumCoke - ts_train_PetroleumCoke.shift(periods=12)
MAvg_shift = df_shift_PetroleumCoke.rolling(window=12).mean()
MStd_shift = df_shift_PetroleumCoke.rolling(window=12).std()
plt.plot(df_shift_PetroleumCoke, color='c')
plt.plot(MAvg_shift, color='red', label = 'Moving Average')
plt.plot(MStd_shift, color='green', label = 'Standard Deviation')
plt.legend()
plt.show()
df_shift_PetroleumCoke.dropna(inplace=True)
adfuller(df_shift_PetroleumCoke)
plt.figure(figsize=(16,8))
ts_log_PetroleumCoke = np.log(ts_train_PetroleumCoke)
MAvg = ts_log_PetroleumCoke.rolling(window=12).mean()
MStd = ts_log_PetroleumCoke.rolling(window=12).std()
plt.plot(ts_log_PetroleumCoke)
plt.plot(MAvg, color='r', label = 'Moving Average')
plt.plot(MStd, color='g', label = 'Standard Deviation')
plt.legend()
plt.show()
adfuller(ts_log_PetroleumCoke)
plt.figure(figsize = (16,8))
plot_acf(df_shift_PetroleumCoke, lags = 24)
plt.show()
plot_pacf(df_shift_PetroleumCoke, lags = 24)
plt.show()
df_seas_PetroleumCoke = df_shift_PetroleumCoke - df_shift_PetroleumCoke.shift(periods =12)
MAvg_seas = df_seas_PetroleumCoke.rolling(window=12).mean()
MStd_seas = df_seas_PetroleumCoke.rolling(window=12).std()
plt.plot(df_seas_PetroleumCoke, color='c')
plt.plot(MAvg_seas, color='red', label = 'Moving Average')
plt.plot(MStd_seas, color='green', label = 'Standard Deviation')
plt.legend()
plt.show()
plt.figure(figsize=(16,8))
model_SARIMA = SARIMAX(ts_train_PetroleumCoke, order=(0,1,0), seasonal_order=(1,1,0,12))
results_SARIMA = model_SARIMA.fit()
plt.plot(ts_train_PetroleumCoke)
plt.plot(results_SARIMA.fittedvalues, color='red')
plt.title('SARIMA Model - RMSE: %.4f'% mean_squared_error(results_SARIMA.fittedvalues,ts_train_PetroleumCoke['Petroleum Coke'], squared=False))
plt.show()
results_SARIMA.aic
plt.figure(figsize=(16,5))
model_ARIMA = ARIMA(ts_train_PetroleumCoke, order=(1,1,1)) #Using p=1, d=1, q=1
results_ARIMA = model_ARIMA.fit()
plt.plot(ts_train_PetroleumCoke)
plt.plot(results_ARIMA.fittedvalues, color='red')
plt.title('ARIMA Model - RMSE: %.4f'% mean_squared_error(results_ARIMA.fittedvalues, ts_train_PetroleumCoke['Petroleum Coke'], squared=False))
plt.show()
results_ARIMA.aic
plt.figure(figsize=(16,5))
model_ARMA = ARIMA(df_shift_PetroleumCoke, order=(1, 0, 1)) #Using p=1, d=0, q=1
results_ARMA = model_ARMA.fit()
plt.plot(df_shift_PetroleumCoke)
plt.plot(results_ARMA.fittedvalues, color='red')
plt.title('ARMA Model - RMSE: %.4f'% mean_squared_error(results_ARMA.fittedvalues, df_shift_PetroleumCoke['Petroleum Coke'], squared=False))
plt.show()
results_ARMA.aic